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Abstract 

Background: Immunosignaturing is a new peptide microarray based technology for profiling of humoral immune 
responses. Despite new challenges, immunosignaturing gives us the opportunity to explore new and 
fundamentally different research questions. In addition to classifying samples based on disease status, the complex 
patterns and latent factors underlying immunosignatures, which we attempt to model, may have a diverse range 
of applications. 

Methods: We investigate the utility of a number of statistical methods to determine model performance and 
address challenges inherent in analyzing immunosignatures. Some of these methods include exploratory and 
confirmatory factor analyses, classical significance testing, structural equation and mixture modeling. 

Results: We demonstrate an ability to classify samples based on disease status and show that immunosignaturing 
is a very promising technology for screening and presymptomatic screening of disease. In addition, we are able to 
model complex patterns and latent factors underlying immunosignatures. These latent factors may serve as 
biomarkers for disease and may play a key role in a bioinformatic method for antibody discovery. 

Conclusion: Based on this research, we lay out an analytic framework illustrating how immunosignatures may be 
useful as a general method for screening and presymptomatic screening of disease as well as antibody discovery. 
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Background 

The human immune system is a rich source of informa- 
tion about the health and disease status of an individual 
[1-4]. Immunosignaturing is a new technology that may 
be useful to decode the vast amounts of health informa- 
tion contained in the immune system. An immunosigna- 
ture is a pattern containing multiplexed signals from 
chronic or recently matured antibodies. These signals 
come from a sufficiently diverse set of peptide targets 
on a microarray. Thousands of peptides of random 
sequence (mimotopes) provide the density and diversity 
sufficient to discriminate different diseases. An initial 
question, and the aim of this paper, is how best to ana- 
lyze and decode the information from immunosignatur- 
ing studies. Previous reports [1-3] used frequentist 
statistics (ANOVA or t-test) and cluster analysis (hier- 
archical clustering and Principal Components) to 
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identify features that classify disease states. We examine 
other methods that may yield better performance in 
immunosignature analyses. Corrected T- Tests as well as 
logistic and multinomial logistic regression models have 
demonstrated an ability to differentiate between patients 
with different disease states even after stringent correc- 
tions for running multiple statistical tests (alpha infla- 
tion). Confirmatory factor analysis is an additional 
method which provides an abundance of information 
relating to the clustering of samples as well as providing 
an alternative method for categorizing and determining 
the disease state of a single sample. Descriptive statistics 
help to paint a better picture of the overall immune sys- 
tem activity. Finally, structural equation modeling and 
mixture models can help explain the underlying struc- 
ture of an immunosignature. 

For these analyses we examined a dataset containing 
breast cancer samples along with patients who had a 
second primary tumor (not a recurrence). The group 
with a second primary tumor was included in the ana- 
lyses because if these patients could be diagnosed as 
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having a high probability of developing a second tumor, 
they could be more closely monitored. 

In an immunosignaturing study, sera samples are col- 
lected from participants and the physical information 
from the immune system is extracted using high density 
peptide microarrays. Each microarray contains a large 
number of peptides; in this case 10,375 peptides. The 
selection of these peptides was designed to give broad 
spectrum coverage of relevant antigens in the human 
immune system. The relevant nature of each peptide 
capitalized on early phage display research [1]. The deci- 
sion was made to use a peptide microarray instead of 
phage library panning because of the increased speed 
and efficiency offered by a peptide microarray [1]. Ide- 
ally, if we can better understand the information cap- 
tured by the peptide microarrays we may be able to 
develop quick, accurate, unobtrusive and inexpensive 
screening tests for many types of disease. 

Classic peptide microarrays are created by spotting 
overlapping peptides corresponding to linear sequences 
of proteins known to be involved in an infectious dis- 
ease. These arrays cannot identify non-linear epitopes. 
The epitopes are identified when B-cells produce antibo- 
dies (usually IgG) specific to 8-12 residue peptides that 
are components of the antigen protein. In contrast, 
immunosignaturing arrays utilize random-sequence pep- 
tides. Random sequence peptides have some specific and 
reproducible affinity to antibodies, and determining the 
level and pattern of binding is core to determining the 
difference between patients with different diseases. 

Although much research has been done on statistical 
analyses using microarrays, immunosignaturing microar- 
rays pose a number of novel challenges not encountered 
in traditional microarrays. In nucleic acid microarray 
technologies, binding is essentially only between two 
types of molecules of complementary sequence. For 
example, in a genotype array, genomic DNA binds to 
complementary nucleic acid probes that have either 
matches (e.g., perfect match, PM) or mismatches (MM) 
and the signals from the different probes are combined 
to make homozygous and heterozygous base calls for 
individual single nucleotide polymorphisms (SNPs). In a 
gene expression microarray, only a specific fragment of 
RNA will bind to the oligonucleotide on the array. With 
modern microarrays, as long as there is a sufficient 
abundance of RNA on the array, it will generally bind 
only to the specific complementary probe, with very lim- 
ited non-specific binding. 

With immunosignaturing microarrays, the intensity 
values are a continuous value from 0-65,000 and binding 
is not restricted to a single "complementary" molecule. 
Multiple antibodies in IgG could bind to the same 
20mer peptide on the array. Also, although the immu- 
nosignaturing arrays are designed to measure IgG, there 



may still be competitive binding from other material in 
the sera and from other types of immunoglobulin. Com- 
petitive binding could result in an IgG antibody not 
binding at all or binding with a lower affinity. This 
could be potentially problematic if the auxiliary particle 
reducing binding affinity does not differ systematically 
across groups. Furthermore, a single antibody may also 
bind to multiple peptides on the array; a problem almost 
non-existent in genotype or gene expression arrays. 

With the potential for so many different things to 
bind to a peptide on the array, it is not immediately 
clear how accurately traditional and more novel statisti- 
cal methods would perform. One primary goal of the 
research reported here was to determine if the proposed 
statistical methods were capable of effectively analyzing 
the data and producing a correct pattern of results. For 
example, with a number of different things binding to a 
peptide and antibodies binding to multiple peptides it 
was initially uncertain if this would produce erratic sig- 
natures which would lead to incorrect results when cer- 
tain methods were used. 

Despite a number of new complexities created by 
immunosignaturing microarrays, these challenges give 
us the opportunity to test the performance of classically 
used methods such as factor analysis models in a differ- 
ent environment while also allowing us to ask new and 
fundamentally different research questions. In order to 
answer these new research questions, there is a need to 
use different statistical models not commonly used to 
analyze microarray data. This is because more tradi- 
tional models used to analyze microarrays lack the ver- 
satility to adequately capture and explain the 
complexities of immunosignatures. Here, we explore the 
use of structural equation models in order to try to 
determine whether the immunosignature formed by the 
fluorescent values of the 10,375 peptides is mostly ran- 
dom or if there is a consistent underlying pattern or fac- 
tor structure to an immunosignature that correlated 
with disease. This research question is made possible 
because of the novelty in immunosignaturing arrays that 
that allow a single antibody to bind to multiple peptides 
on the array. This research shows that there are com- 
plex and consistently reproducible structures underlying 
peptides which differentiate groups. Such patterns can 
be used as biosignatures for disease as well as provide 
deep insight into antibodies and immune response to 
disease. Although there are new analytic challenges in 
immunosignaturing, it is these exact challenges that pro- 
vide the promise of new discoveries while laying the 
groundwork for applications in future research and 
technologies. 

In this paper we present a range of statistical methods, 
their use and demonstrate what type of information they 
can provide researchers in immunosignaturing studies. 
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We show the ability to classify samples into their 
respective disease categories and find peptides which 
significantly predict disease status. This provides a pro- 
mising method for screening and potentially presympto- 
matic screening of disease. We also identify a number of 
latent factors using SEM. We hypothesize that the latent 
factors being modeled may represent specific antibodies 
that differ among disease classes. 

Methods 

Patient samples are analyzed by applying the sera or 
plasma to the array at a 1:500 dilution, detected with an 
anti-human fluorescent antibody, and the signals are 
read using an Agilent C laser scanner. Images are pro- 
cessed using GenePix Pro 8 providing a text file of 
values for each peptide. Binding affinity is a continuous 
value from 0-65,500 (16 bit image). Genepix software 
was used to convert the 16-bit TIFF images to values, 
median non-background subtracted values were used 
and log 10 transformation was done on the median nor- 
malized intensity values. Three distinct datasets were 
used in these analyses. One was a set of samples from a 
random group of individuals without breast cancer, a 
second set of samples is from a group with breast can- 
cer and finally the third set of samples is from a group 
of patients who were diagnosed with a second primary 
tumor. The normal samples were a convenient sample 
of individuals without any known breast cancer history. 
The breast cancer samples were a sample of current 
breast cancer patients with different levels of disease 
progression and diverse demographic backgrounds. 
There were 52 samples from normal individuals without 
cancer, 98 samples from cancer patients with a single 
primary tumor and there were 21 samples with second 
primary tumors. Human subjects protection was 
observed, collaborators ensured all samples were col- 
lected under the same protocol. All of the sample came 
from females between the age of 45 and 54. The specific 
ages for each sample was kept from us because of 
HIPPA and patient privacy concerns. All pre-processing 
was median-normalization per microarray slide, to 
adjust for global intensity bias. Data was also log 10 
transformed. The spot intensity was the median signal 
(obtained by GenePix Pro) with no local background 
subtraction. Background subtraction was not used 
because the arrays showed consistent background across 
the 1172 empty spots which were spread across the phy- 
sical surface of the array. Technical replicates also 
showed greater reproducibility without background sub- 
traction than with, indicating that the method for sub- 
tracting background was not useful. Additionally, the 
local and global background estimates were, on average, 
150-300 RFU, which for any microarray is extremely 
low considering the 3+ logs of dynamic range. 



It is common in similar lines of research, such as gen- 
otype experimentation to use a pattern matched experi- 
mental design. Matching participants in an experiment 
has the effect of increasing homogeneity among groups. 
As a result, the reduced within class variation which 
often accompanies matching designs has the effect of 
reducing the standard error and denominator of com- 
mon statistical tests. This in turn leads to higher statisti- 
cal power. Additionally, more homogeneous groups 
often enable easier classification in exploratory models. 
In the data analyzed here, the normal non-cancer sam- 
ples were not matched to either the cancer groups, how- 
ever research has shown that the signature of immune 
response is far less susceptible to the type of personal 
factors that genetic studies are - even HLA has only a 
minor effect on the consistency of a disease state immu- 
nosignature pattern [1,2]. 

Given that immunosignaturing is a new technology, 
early investigations, contrary to initial belief actually 
capitalize on the lack of rigid experimental designs. This 
is because additional sources of variance in the data 
allow us to better understand the robustness of the 
technology and related statistical analyses. If a method 
can perform well in a somewhat noisy environment with 
loose experimental designs, it is highly likely to perform 
even better when well curated studies (such as matched 
designs) are performed. In many respects, testing immu- 
nosignaturing data with loosely structured and curated 
data provides a much more stringent test of the technol- 
ogy and methods. Being able to obtain statistically sig- 
nificant results with the correct patterns of results from 
such unstructured data illustrates the versatility of 
immunosignaturing technology and the statistical meth- 
ods tested here. 

Understanding the robustness of the technology pro- 
vide guidance for future experiments using this technol- 
ogy while giving insight into the potential clinical use of 
immunosignaturing. Biologically, it is possible that 
healthy normal individuals with no active infection are 
responding immunologically to their environment, and 
persons with an infection have a focused immune 
response. It is likely that high variation in immune 
response to an environment would be present across 
individuals. Therefore, in order to be clinically useful, it 
is imperative that the technology and methods are 
robust enough to function accurately outside of pre- 
cisely controlled laboratory settings; as would be 
encountered during clinical deployment of the 
technology. 

Results 

Descriptive Statistics 

Table 1 provides basic descriptive fluorescence intensity 
statistics of each of the three disease groups. Descriptive 
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Table 1 Fluorescence intensity and descriptive statistics 



for the three disease groups 



Group 


Mean 


Minimum 


Maximum 


Std. 
Deviation 


Variance 


Range 


Normal 


329 


207 


9672 


93 


10336 


9465 


Single 
Tumor 


336 


204 


16702 


115 


16258 


16498 


Second 
Tumor 


676 


36 


49880 


549 


339301 


49844 



statistics of an immunosignature provide a significant 
amount of insight into the underlying immune response 
during disease states. Of particular biological interest in 
this sample is the difference in the range of values from 
the three groups. The normal and single tumor cancer 
samples have ostensibly the same floor value while the 
second primary tumor cancer samples have a much 
lower floor value. This may suggest a suppression of the 
immune system in second primary tumor cancer sam- 
ples. The single tumor cancer and second primary 
tumor cancer samples have progressively higher maxi- 
mum values which may suggest an increased immune 
response associated with cancer and a reoccurrence of 
cancer. 

Although there are large differences in the ranges, in 
order to have any predictive validity, the differences in 
ranges need to be consistent across samples within each 
group. For example, a high fluorescence value over 
45000 in the second tumor samples needs to occur on a 
given peptide with regularity to produce a statistically 
significant result. 

Classical Statistical Significance Tests 

There are a number of statistical tests which could 
potentially be used to test whether the differences 
between groups across peptides are significant beyond 
what would be expected by chance alone. Some of these 
methods include the T-Test, corrected T-Tests, Logistic 
Regression and Multinomial Logistic regression. The 
standard T-Test divides the mean difference between 
two groups by a standard error to produce a T-Statistic 
used for null hypothesis significance testing. One pro- 
blem with the standard T-Test is that normal theory 
underlying the test makes the assumptions that the var- 
iances in both groups are equal. The problem of 
unequal variances in a T-Test is commonly known in 
the statistics literature as the Behrens-Fisher problem 
and has been researched for the better part of the last 
century in various contexts. If the assumption of equal 
variances is violated, the T-Statistic can be either 
inflated or deflated depending on the samples sizes in 
each group. As a result, the analyses were conducted 
using a Satterthwaite corrected T-Test. The 



Satterthwaite test is one of numerous corrections for 
unequal variances that have been proposed over the 
years. The Satterthwaite test works by adjusting the 
degrees of freedom in the test. The resulting correction 
produces an asymptotically correct T-statistic when 
groups have unequal variances. The Satterthwaite cor- 
rection works by modifying the degrees of freedom via 
equation 1: 

{wl + w2) * 2 
* = wl*2 w2*2 (1) 
n2 - 1 n2 - 1 

A Satterthwaite corrected T-Test and a number of 
similar test corrections which could have also been used 
such as a Brown-Forsythe correction in an ANOVA 
model tended to produce statistically significant results 
after a Bonferroni correction for multiple testing (alpha 
inflation). A Bonferroni correction was used to protect 
against alpha inflation because with a standard alpha 
level of .05, purely by chance alone, 1 out of 20 tests 
will be significant. The Bonferroni correction divides the 
alpha value by the number of tests run; in this case 
10,375, or one for each peptide on the microarray. This 
resulted in a corrected p-value threshold of 4.819*10" 6 . 
Nonetheless, despite this much lower p-value, highly sig- 
nificant results are still obtained for Satterthwaite cor- 
rected T-Tests comparing normal versus single tumor 
cancer samples, normal versus second diagnosis samples 
and single tumor cancer versus second primary tumor 
cancer samples. Table 2 shows the top 10 significant 
peptides for a Satterthwaite corrected T-Test comparing 
normal samples to cancer samples. Logistic and Multi- 
nomial logistic regression may also be of interest and an 
alternative method for comparing groups to the tests 
used here. One place in which logit models may be use- 
ful is if a researcher in future studies has a known set of 
covariates they wish to control for. For example, in the 
study of diabetes, it may be of interest to control for 
body mass index or HB1AC test results. 

Exploratory Factor Analysis 

Factor analytic models have previously been used in 
analyzing immunosignatures and are quite common in 
analyzing high dimensionality microarray data [2,5,6]. 
Each of the models explored during this line of research 
were investigated in order to determine its feasibility for 
answering a specific research question. Exploratory fac- 
tor analysis (EFA) was examined as a method to be able 
to differentiate samples based on disease states with no 
prior clinical knowledge of the samples. Estimation of 
EFA models was performed using ordinary least squares 
(OLS). EFA with Promax rotation proved significantly 
better than chance at classifying samples. EFA is a set of 
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Table 2 Top 10 significant peptides for a Satterthwaite corrected T-Test comparing normal samples to cancer samples 



Variable ID 


Peptide Sequence 


T-Value 


Degrees Of 
Freedom 


P-Value 


V2833 


i irni/\A/i n/nm a /i/i n \\/\/\ a i\/r~cr~ 

H V KKW H KKKW KH H KKW KGSC 


-6.51 


1 32.4 


1 .4372L-09 


V31 13 


i mn/\A/i i\a/i/i inn n in mrrrr 

H Kh KWH WKH Kh H H V H KhGSC 


-6.29 


144.41 


3.5843L-09 


vo/ / z 


n\<f\(\-\nnr-,^f\<\ p\a/i ^kac^c 




1 AA RA 




V9732 


WRRSTPVGPVmVFGKFLGSC 


-6.12 


146.1 


8.1933E-09 


V7196 


RFGRPQHQHDFRRHAIYGSC 


-6.06 


146.8 


1.1046E-08 


V6978 


QSHMTLAPGIRRYKKFNGSC 


-6.06 


146.32 


1.1046E-08 


V7387 


RMGFGLYERLWG^NHYGSC 


-6.01 


134.26 


1 .6532E-08 


V9561 


WKWKRHWKWPHRRKHFFGSC 


-5.95 


144.49 


1.9475E-08 


V6987 


QSIGLGYSAFMPKWPFRGSC 


-5.93 


140.13 


2.2543E-08 


V3249 


HWKRHHRPKHKHHRHKHGSC 


-5.9 


145.4 


2.4586E-08 



procedures that accounts for the relationship among a 
set of variables in terms of a smaller set of underlying 
latent constructs or factors. (For example, a factor is a 
disease state.) We specifically use principal axis factoring 
with iterated communalities. Although PCA and EFA 
are quite similar, an important difference between the 
two methods is that PCA makes the assumption that all 
of the variance in an item is a reflection of common 
variance shared among all items whereas EFA posits 
that each item shares some common variance with all 
other items but also has its own unique variance. Math- 
ematically the difference between PCA and EFA is the 
addition of single matrix; D 2 . 

Rzz = A * Rf * A' + D 2 (2) 

In equation 2 R zz is the correlation matrix among the 
observed variables. A is a matrix of factor loadings, R f is 
the correlation matrix among the factor loadings, the A' 
denotes the transpose of the A matrix of factor loadings 
and thus AR f A' is the matrix representation of the com- 
mon factor structure. D 2 is a diagonal matrix that cap- 
tures the unique variance weights and distinguishes EFA 
from PCA. 

Varimax and Promax rotation methods were 
explored in depth. This is in part because Varimax is 
often a starting point for a Promax rotation. A sample 
is said to "load on" a given factor when the model sug- 
gests a strong fit on the given factor. Rotation in EFA 
is a method for making factor loadings more interpre- 
table. Rotation methods change the relationship 
between items and the factors (which are geometrically 
represented as axes). Rotation does not change the 
relationship among the individual items. Since rotation 
methods only make changes to the axes and not to the 
communalities (variance accounted for), rotation does 
not mathematically change the initially obtained 
results. Rotation makes the factor loadings more 
interpretable. 



Varimax uses a complexity function to maximize the 
variance of the squared loadings on each factor. This 
results in loadings with a more even spread across the 
factors; as opposed to having an overabundance of load- 
ings on a first factor. Varimax is an orthogonal rotation 
that maintains the orthogonal (90 degrees) intersection 
of the axes. This has the result of keeping the correla- 
tion between the factors at zero because the cosine of 
90 degrees is 0. 

Promax is an oblique rotation that allows the angle 
between the axes to vary. In statistics, variance has to be 
accounted for in some part of the model. Allowing the 
axes to vary and thus a correlation between the factors 
is another path to account for variance. Allowing var- 
iance to be expressed in terms of correlations between 
factors has the result of not forcing variance between 
factors to be represented as between item variance. This 
can result in cleaner factor loadings. Additionally, the 
assumption that there is no correlation between factors, 
or in this analysis, disease states, is unlikely because 
there will always be some additional common variance 
and similarities in immune samples due to basic 
immune responses and structures present across all 
samples. 

Unlike Varimax, Promax does not use a complexity 
function. Rather, Promax rotation is a procrustean rota- 
tion to a target matrix. In Promax, a pattern matrix of 
loadings (often derived from Varimax rotation) is taken 
to some power (i.e., squared, cubed etc.) to form a tar- 
get matrix. The original loading components are then 
rotated to get as close as possible to the newly formed 
target matrix. A number of EFA models with Promax 
rotation were run to investigate the utility of this 
method for differentiating between groups with no prior 
knowledge of group membership. Table 3 provides sum- 
mary results. The number of factors was known to be 2 
for each comparison. Scree plots were used to validate 
the hypothesis. None of the plots suggested the presence 
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Table 3 Exploratory Factor Analysis Results 



EFA Model 


Correct 




Classification 


Single Primary Tumor and Second Primary Tumor 


93.45% 


Samples 




Non-Cancerous and Second Time Cancer Samples 


84.4% 


Non-Cancerous and First Time Cancer Samples 


68% 



of a strong third factor. A scree plot plots the eigenva- 
lues for each component. The largest components 
before a leveling off is used to determine the appropriate 
number of factors. Factor loadings greater than .3 were 
said to load on a given factor. If loadings for both fac- 
tors were less than .3 the sample was said to not 
counted as a correct classification on either match. 
Catell (1966) provides a more detailed description of 
how to use eigenvalues and scree plots for determining 
the number of factors [7]. 

An EFA between cancer samples and the samples from 
patients who had a second primary tumor produced a 
correct classification for 93.45% of the cases. Of the cases 
that were miscategorized, all of them except one were 
cancer cases that loaded more highly with the second pri- 
mary tumor group. There are a few possible explanations 
for this. This could simply be model error resulting from 
the lack of homogeneity among the first time cancer 
group. However, it is possible that the miscategorized 
cases may represent individuals who will at some point in 
the future develop a second primary tumor or are unbe- 
knownst to the researchers already in the process of 
developing one. All this says is that less than 10% of can- 
cer samples are more closely related to the samples of 
individuals who had acquired a second primary tumor 
than the samples with a single primary tumor. 

A second EFA was run between the normal or non- 
cancerous samples and the samples with a second pri- 
mary tumor. The overall classification accuracy was 
84.8%. Within this model, 74.1% of the normal samples 
loaded correctly on the same factor whereas 100% of 
the second primary tumor samples loaded on the cor- 
rect and same factor. 

A third EFA was run exploring the relationship 
between normal or non-cancerous samples and single 
tumor cancer samples. Using the same model specifica- 
tions as in the first model, this EFA produced a 68% 
classification accuracy. Although this is quite low by tra- 
ditional model building standards, there are a number 
of factors relating to the data which may make this a 
useful starting point. First, the normal patients were 
taken from a wide range of convenient lab samples. 
Some of the normal samples may have come from indi- 
viduals outside of the age and traditional demographic 
background to even be remotely at risk for breast 



cancer. Secondly, the stage and progression of cancer 
patients was unknown. As a result, an additional possi- 
bility for the classification accuracy may be that the 
cross loadings represent a mixture of early stage cancer 
patients and those at high risk for or who are develop- 
ing cancer. 

Unfortunately, detailed information about the disease 
state of the samples is unavailable and does make con- 
jectures purely hypothetical. However, in all models, the 
results are significantly better than chance and illustrate 
in many ways the performance of the technology and 
approach under adverse conditions. The three models 
taken in concert illustrate that the lack of a concrete 
and well curated control group is likely responsible for 
the decremented classification accuracy in some models. 
This can be most clearly seen when considering that the 
single tumor cancer and second primary tumor cancer 
samples consistently exhibit stable factor loadings with 
relatively low cross loadings because the single tumor 
cancer samples serve as a much cleaner control group 
for the second primary tumor cancer samples than the 
normal do for either of the cancer groups. This early 
research suggests that future studies using more pre- 
cisely selected control groups and experimental design 
would have even better ability to classify cancer patents. 

Beyond classification accuracy, the similarity between 
different factor based models and rotations is extremely 
informative from a biological perspective. All combina- 
tions of PCA and EFA with Varimax or Promax gave 
highly similar results with respect to overall classifica- 
tion of groups across a number of different analyses. 
Although specific factor loadings certainly had different 
values, the overall picture and classification accuracy 
was relatively constant. Brief investigations into other 
rotations such as Oblimin were also explored in the 
context of EFA models and produced similar results to 
Varimax and Promax. 

First, with respect to PCA versus EFA, the lack of differ- 
ence suggests that the vast majority of the variance 
accounting for classification is at the factor level (ie. osten- 
sibly disease state) and not the individual level. This is 
because as the D 2 matrix which differentiates the two 
methods captures the unique variance in an EFA model 
and as the D 2 matrix approaches zero, an EFA model 
approaches a PCA model. Therefore, since the D 2 matrix 
is the only difference in the equation and an analytic solu- 
tions exists due to Ordinary Least Squares estimation, we 
can conclude that the lack of difference was because there 
was relatively little unique variance present. 

Confirmatory Factor Analysis 

Since EFA models showed the ability to differentiate 
samples, a logical clinical application of immunosigna- 
turing would be to screen a single sample from an 
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individual to determine his or her disease status. Confir- 
matory factor analysis (CFA) was chosen as an ideal 
method for investigating this question due in part to the 
similarity with EFA and because of the versatility to 
examine one specific sample in detail. EFA is an 
exploratory method that should be used when the num- 
ber of groups or structure of the data is not well under- 
stood. Conversely, CFA is a confirmatory method that 
can be used when the structure of the data is well 
understood. As the name implies, exploratory factor 
analysis, EFA models should not be used as confirma- 
tory model or to confirm a hypothesis. 

Both CFA and EFA attempt to explain the underlying 
structure in a dataset. However, CFA and EFA approach 
the problem from two distinct directions. EFA makes 
almost no prior assumptions about the structure of the 
data and attempts to sort through the data to help a 
researcher determine what the underlying structure of 
the data is. In this research, the general group member- 
ship was known and thus the appropriate number of 
factors was specified apriori. In a CFA model, the 
researcher explicitly identifies not only the number of 
factors but which cases load on each factor as well as 
factor variances, covariances between the factors and 
disturbances for each item. CFA models are not data 
mining approaches and require well formulated notions 
about the underlying structure of the data. 

Mathematically, the simplest formulation of a CFA 
model in matrix notation is: 



X = A * f * A L 



(3) 



In Equation 3, x is a vector of observed variables, A is 
a matrix of factor loadings, £ is a matrix of scores for 
each variable on a factor or latent construct and A is a 
vector containing measurement error. 

In the CFA models analyzed here, one sample from 
each factor (disease state) was chosen at random as a 
scaling constraint in order to ensure identification in 
these models. Maximum likelihood estimation with 
robust standard errors was used to estimate these CFA 
models. The known disease status was the basis for 
defining the factor loading for each sample. A sample 
was allowed to load only on a single factor and fixed to 
zero on the other. Variances and covariances between 



Table 4 Confirmatory Factor Analysis Results 



CFA Model 


Correct 




Classification 


Single Primary Tumor and Second Primary Tumor 


89.9% 


Samples 




Non-Cancerous and Second Time Cancer Samples 


93.1% 


Non-Cancerous and First Time Cancer Samples 


83.4% 



all factors were estimated. Summary results are provided 
in Table 4. 

For a CFA comparing single tumor cancer samples and 
second primary tumor samples, 89.9% of samples loaded 
on the specified factor. For a normal versus second pri- 
mary tumor CFA, 93.1% of the samples loaded on the 
specified factor and a normal versus single tumor CFA 
produced sample loadings on the specified factor 83.4% 
of the time. The difference in classification accuracy 
between the CFA and EFA models is due to a number of 
factors; some of which include model variance and covar- 
iance specifications as well as different estimator types. 

One primary advantage CFA models have over EFA 
models are fit indices which give some quantitative mea- 
sure of how accurately the specified model is. Although 
there are a plethora of fit indices that have been proposed 
within the structural modeling framework that CFA 
models reside, the chi-square difference test, root mean 
square error (RMSEA) and standardized root mean error 
(SRMR) are among the most common and widely cited. 

The chi-square test ostensibly tests how well the speci- 
fied model reproduces the covariance matrix from the 
original data. The problem with this test is that it is so 
sensitive that it is nearly impossible to obtain statistically 
non-significant results. It is important to note that the 
null hypothesis of this test is that there is no difference 
between the specified model's covariance matrix and the 
covariance model in the actual data, a non-significant p- 
value is the desired outcome. Because it is of interest to 
find no difference between the specified model and the 
data, a non-significant p-value is the goal. The chi-square 
test for all of the CFA models was significant with p < 
.001 suggesting that there is a statistically significant dif- 
ference between the specified model covariance matrix 
and the covariance matrix of the original data. However, 
the chi-square test is extremely sensitive and often 
detects trivial differences [8,9]. Noting the sensitivity of 
the test is not meant to suggest that in fact the specified 
CFA models are perfect fits or deny lack of fit. Rather, 
the test is noted because it is among the most common 
fit indices and the issues with the test are noted as a 
means of providing appropriate context for the results. 

The Root Mean Square Error of Approximation 
(RMSEA) and the Standardized Root Mean Square Resi- 
dual (SRMR) are two common fit indices used in the 
Structural Equation Modeling (SEM) framework 
description; of which CFA is a part of. The basis of the 
RMSEA is a non-centrality parameter. The simplest 
reduced form of the RMSEA equation is: 



RMSEA : 



XT \ 

?)-' 

\| n- 1 



(4) 
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In equation 4, X 2 is the model generated chi-square 
value, df is the degrees of freedom and n is the sample 
size. Smaller RMSEA values suggest better fit. The 
SRMR measures the standardized difference between 
the observed covariance matrix and the model implied 
covariance matrix. 

For the CFA model for single tumor samples versus 
second primary tumor samples, the RMESA was .083 
and the SRMR was .071. For the CFA model comparing 
normal versus second primary tumor samples the 
RMSEA was .097 and the SRMR was .076 while the nor- 
mal versus single tumor samples produced a RMSEA of 
.07 and a SRMR of .074. These are marginally significant 
results because traditional benchmarks cite .05 as a cut- 
off for statistical significance [8]. RMSEA and SRMR 
values in the .05-.08 range are usually regarded as mar- 
ginally significant. Although the results do not meet the 
rigid .05 level, they are actually quite impressive when 
considering the experimental design and the fact that a 
portion of the lack of fit may actually be representing 
natural biological patterns such as the development of a 
first or second tumor. 

Perhaps the real utility of a CFA model for immuno- 
signaturing could come in the form of diagnostic testing. 
Given the accuracy of the CFA model with this data, 
once a well curated set of samples for a certain disease 
or collection of diseases has been established, a CFA 
model could be specified where a new unknown sample 
could be allowed to load on both (or multiple) factors. 
By comparing the relative loadings on the factors, it 
would be possible to determine to which group the sam- 
ple most likely belongs. For example, there are numer- 
ous subtypes of breast cancer and different stages of 
disease progression. If a collection of samples was avail- 
able as a concrete reference set, a CFA model could be 
easily and accurately employed as a new method for aid- 
ing in the diagnosis as well as perhaps early detection of 
breast cancer. 

Structural Equation Models 

From EFA, CFA and descriptive statistics we know that 
the immunosignatures as a whole are in fact different 
across groups while corrected T-Tests show that there 
are statistically significant systematic variations. The 
logical question arising from these findings is how pre- 
cisely do the immunosignatures differ from one another? 
Is there a clear, consistent and reproducible pattern 
underlying the differences in immunosignatures across 
disease states? Because a single antibody can bind to 
multiple peptides and different antibodies can bind to 
the same peptide, a coherent pattern of peptide fluores- 
cence across an immunosignature is much more infor- 
mative than the fluorescence of individual peptides on 
their own. Furthermore, being able to identify common 



relationships and covariances between groups of pep- 
tides is of even greater utility. This can be accomplished 
by modeling latent factors. 

On a genotype microarray, the probe is directly mea- 
suring an individual's genotype at a specific location. In 
contrast, the peptide probes on an immunosignature 
array are indirectly measuring immune response and 
antibodies present in the sera. When measures are not 
directly observed they are often referred to in statistical 
and structural equation modeling literature as latent fac- 
tors. If there are clear, consistent and reproducible pat- 
terns caused by specific antibodies in a sera sample 
binding to peptides on an immunosignaturing array, it 
should be possible to model individual antibodies as 
latent factors. For example, when reading the tick marks 
on a mercury thermometer, one is not reading a direct 
measure of temperature but rather displacement of mer- 
cury. The latent factor measured by displacement of 
mercury is temperature because from a purely physics 
standpoint, temperature is the kinetic energy of an 
object; usually measured at the molecular level. Another 
example of a latent factor is depression. Psychologists 
cannot directly measure depression but they can ask a 
series of questions that cumulatively allow them to 
model the latent construct of depression. Each question 
in a depression inventory gets at one small piece of the 
latent factor depression in much the same way that pep- 
tides on an immunosignaturing array provide an indirect 
measure of immune response; as measured primarily by 
IgG antibodies. 

Structural equation modeling (SEM) is specifically 
designed for modeling latent variables. SEM models 
have two parts: a path model comprised of regressing a 
set of variables on another and a measurement model in 
which CFA is used to form latent variables. When a set 
of measured variables is set to load on a given factor, 
the result is a latent factor. In SEM, the resulting latent 
variables can be treated as either endogenous or exogen- 
ous variables; depending on the research question of 
interest. A full SEM model is a collection of equations 
defining each variable and their relation to one another. 
Since complex models can quickly generate a large 
number of equations, SEM models are often represented 
graphically for quicker interpretation. Since confirma- 
tory factor analysis is a major component in a full latent 
variable structural equation model, attempting to classify 
samples with factor analytic methods lent evidence to 
the feasibility of SEM models. These early models also 
provided a plethora of background information which 
aided in the testing of full SEM models. 

Initial SEM Testing 

Despite evidence from previous factor analytic models 
that SEM models should be feasible, since these are 
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highly complex models, an incremental approach was 
taken to building and testing large scale SEM models. 
To start with, a measurement model and full structural 
equation model was run using the top three peptides 
from the normal versus single tumor cancer samples 
(Table 2) to predict disease. The measurement model 
(ostensibly a confirmatory factor analysis) in a SEM 
model tests the loadings of individual peptides onto 
latent variables. In this model one peptide was set as a 
scaling constraint and the other two were freely esti- 
mated. Three peptides were chosen because that is the 
minimum needed for model identification and provides 
for the simplest model. Because of the iterative nature 
of the maximum likelihood algorithms used in SEM 
models, starting with a simple model reduces computa- 
tional time and aids in convergence. Furthermore, start- 
ing with the simplest model and building up is good 
practice in modeling. 

Since a measurement model with 3 factors is just 
identified or has no extra degrees of freedom, fit indices 
cannot be calculated. However, all the variables load 
strongly on the latent factor with loadings greater than 
.7. This finding suggests that the top 3 peptides are indi- 
cative of a single underlying latent factor. 

In order to help rule out the possibility that the con- 
sistent loadings in the first model were not type 1 error 
or false positive, the same model specification was run 
in an attempt to see if the top 3 peptides differentiating 
single tumor cancer samples from second primary 
tumor cancer samples. In this model the top 3 peptides 
also loaded on a single latent variable. Like the first 
model, the second model illustrated the same pattern of 
results with the top 3 peptides all significantly loading 
on a single latent factor. 

The same pattern of results can be replicated with two 
disease contrasts. Replicating the finding with normal 
versus a single primary tumor cancer and second pri- 
mary tumor cancer versus single primary tumor drama- 
tically reduces threats to validity against causal 
conclusions proposed by SEM models of immunosigna- 
turing data. 

When investigating models that differentiate two dis- 
tinct groups from a baseline group (in this case single 
tumor cancer samples) there are three potential out- 
comes. First, a complete lack of model fit and no consis- 
tent underlying factor structure. In this case, none of the 
peptides would load consistently and correctly on either 
of two specified factors suggesting that peptide flores- 
cence is random. The second possibility is that all of the 
peptides would load on one factor. This result could 
result from any number of potential biases in the tech- 
nology itself, printing or processing of the microarrays. 
Another reason all of the peptides might load on one 
common factor is that they are all part of a single latent 



factor. However, because the significance of each pep- 
tide varies quite precipitously across group contrasts, it 
seems unlikely that a single underlying latent factor 
would produce different significance values across dis- 
ease contrasts. The third possibility is that the peptides 
significantly load on two separate factors and that the 
peptides for each contrast exhibit no cross loadings. 

A series of analyses was run using significant peptides 
from normal versus single tumor cancer corrected T- 
Tests as well as second primary tumor samples versus 
single tumor samples combined into a single model. 
The first model was a measurement model which added 
the first two CFA's into one model. The top 3 peptides 
for normal versus single tumor samples and single 
tumor samples versus second tumor samples each were 
set to load on a separate latent factor. A covariance 
between the two latent variables was also estimated. The 
path diagram in Figure 1 illustrates this model. In Figure 
l's path diagram, the square boxes represent measured 
variables, which, in this case are peptide fluorescent 
values. The large circles are the unmeasured latent vari- 
ables. The arrows between the latent factors and mea- 
sured variables show which measured peptides load on 
which latent variable. The curved arrow represents an 
estimated covariance between the two latent variables. 

In path diagrams, the arrows represent the causal flow 
of information. The arrows are pointing from the latent 
variables to the measured variables because the argu- 
ment in SEM models is that there is some unmeasured 
and underlying latent construct that is responsible for 
the observed results of the measured variables. The 
immune response and antibodies present in the sera 
samples is the ultimate causal factor of peptide 
fluorescence. 

The model tested in Figure 1 was estimated using 
maximum likelihood estimation with robust standard 
errors (MLR). The model exhibits excellent model fit 
with an RMSEA of .063 and an SRMR of .031. In 




V2833 V3113 ■ V6772 ■ ■ V4356 V10218 H V7869 



Figure 1 Latent & measured variables in immunosignaturing: 

See Table 2 for peptide sequences. 

I J 
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addition, the Chi- Square test was not significant, Chi-Sq 
= 14.054, df = 8, p = .0804. A non-significant Chi- 
Square test is the desired result. Again, this is because 
the null hypothesis of this Chi-Square test is that there 
is no difference between the observed covariance matrix 
(input data) and the covariance matrix implied by the 
model in Figure 1. These results strongly suggest excel- 
lent model fit and that the latent factors are unique con- 
structs. Biologically, this suggests that a different latent 
factor is underlying each latent variable. 

To further confirm the interpretation that the latent 
factors are different, one peptide from each factor was 
switched. V3113 and V10218 were set to load on the 
opposite factor from the first model. In this new model, 
there was a complete lack of fit. In addition to poor 
loadings, the fit indices dramatically decreased. The 
RMSEA was .354, the SRMR was .192 and the Chi- 
Square was 198.704, df = 8, p < .0001. Thus further sug- 
gests two different underlying constructs rather than 
statistical anomalies. 

An additional set of analyses were run using the top 5 
peptides instead of just the top 3. The first models run 
in this sequence were Varimax and Promax exploratory 
factor analyses. Both models gave 100% classification 
with extremely strong loadings on each factor. Table 5 
is the rotated factor pattern or a two group EFA taking 
the top 5 peptides from each disease contrast. This 
clearly illustrates the top five peptides strongly load on 
factor one while the last five strongly load on the second 
factor. The loadings of peptides are consistent with the 
groups from which each peptide was selected. For exam- 
ple, v4356, vl0218, v7869, v8672 and v8170 were the 
top 5 most significant peptides differentiating first time 
cancer samples from second time cancer samples. In 
combination with earlier results, this very clear and con- 
sistent loading pattern strongly suggests that the top 
peptides for each class form unique latent variables and 
they are almost irrefutably measuring different 



Table 5 Rotated factor pattern for two group EFA of 
significant peptides 



Peptide ID 


Peptide Sequence 


Factorl 


Factor2 


V4356 


KYQFAGQRSGKQYRWRIGSC 


0.88773 


0.05624 


V10218 


YQPPPRKAVIQMDWLSYGSC 


0.92126 


0.06844 


V7869 


SKFRDVLTFNEPSRFVSGSC 


0.51657 


0.04716 


V8672 


WHESMIYRMRFMTFKHGSC 


0.93261 


0.04783 


V8170 


SWRRMRMHKNFMISNLDGSC 


0.87997 


0.06368 


V2833 


HFRKWH KRRWKH H KKWKGSC 


0.11128 


0.7436 


V3113 


HRFKWHWKHRFHHFHRFGSC 


0.06271 


0.82673 


V6772 


QKFKHQQGSFKLPWLSMGSC 


0.12145 


0.73203 


V9732 


WRRSTPVGPVmVFGKFLGSC 


0.05844 


0.88795 


V7196 


RFGRPQHQHDFRRHAIYGSC 


0.035 


0.88098 



constructs. Biologically, this suggests that the latent fac- 
tor which is more active in single tumor cancer samples 
compared to normal samples is not the same latent fac- 
tor that appears to be present in second tumor samples. 

The same result was also found by running a two 
group exploratory factor mixture model with Geomin 
rotation. Geomin rotation is another oblique rotation 
method similar to Promax. A more complete discussion 
of the mathematical differences of rotation methods can 
be found in Browne (2001) [10]. In this data, the 
observed peptides as a whole form a single distribution. 
In mixture modeling, the underlying notion is that the 
distribution formed by all of the observed data is the 
product of two or more underlying distributions; each 
of which represents a distinct class. Ostensibly, an 
exploratory factor mixture model is trying to answer the 
same question as PCA and EFA, PAF/Factor Analysis 
but via a different mathematical framework. Despite the 
complexity of mixture modeling, the basis of an explora- 
tory factor mixture model is for a categorical latent class 
variable C, for a specific class k. The model estimated is: 

Y p = V kp + X kp * T) * (E p (5) 

In equation 5, for a variable Y p , V kp is an intercept 
parameter, X\ <p is a vector of loadings, r| is a vector of 
latent factors and s p is a residual term. In addition, 
there is a correlation matrix *F k for the latent factors rj 
of class k along with a distribution for the latent class 
variable C: P k = P(C = K). In this equation, for a depen- 
dent variable P, the probability of C is equal to k. Also, 
other constraints are added to this basic framework for 
purposes of identification but are related to model spe- 
cific decisions such as orthogonal or oblique rotation. 

EFA mixture models were estimated using Maximum 
Likelihood with Robust Standard Errors (MLR) estima- 
tion and 20 random start values. Random starting values 
were used in part due to the complexity inherent in 
mixture models and to check for local solutions. By run- 
ning the analysis with multiple random start values log 
likelihood (LL) values can be compared. To the extent 
that different LL values are obtained, the random start 
values can be directly input into the model and the 
results can be compared to the best fitting LL model. 
This is useful because if different start values produce 
dramatically different results, this might suggest that the 
algorithm converged at a local maxima instead of global 
maxima or that the results are unstable. 

Fit statistics such as the Bayesian Information Criter- 
ion (BIC) provide a more quantitative analysis of model 
fit for a series of nested models. EFA mixture models 
were estimated for one, two and three class models. 
This approach allows us to confirm that a two class 
model is in fact the best fit for the data. 
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The series of EFA mixture models suggested the same 
pattern of results as traditional EFA models; that there 
are two distinct and separate underlying classes formed 
by the top 5 peptides for each disease contrast. In addi- 
tion, mixture models also produce a statistic for the 
average latent class probability: 

P (Y p = j|C = K) = cp - 1 (T * kpj) - cp - 1 (T* kpj - 1) (6) 

In equation 6 T* kp j is a threshold parameter on a stan- 
dardized correlation metric and 0 is a matrix of resi- 
duals for the latent factors [11]. For both two and three 
class models, the average latent class probability for the 
most likely latent class membership was greater than 
99% for both class 1 and class 2. In other words, for the 
subgroup of samples classified as being part of class 1 
by the model, more than 99% of the time, class 1 was 
also their most likely class membership. This further 
reaffirms the excellent model classification. The three 
class model produced nearly identical average latent 
class probability values because the model did not clas- 
sify any of the peptides as belonging to the third class. 

The BIC was used to assess the best fitting model. 
The BIC is estimated as follows: 

BIC = -2 * LL + p * log (n) (7) 

In equation 7 LL is the log likelihood value of the 
model, p is the number of parameters and n is the num- 
ber of observations. The lower the value of the BIC the 
better the model fit. Often times, BIC values or plots 
are used ostensibly in the same fashion that scree plots 
and eigenvalues are used in PCA or traditional factor 
models where a researcher looks for the point at which 
the decrease in values levels off. However, in this analy- 
sis, the two class model had the lowest BIC and some- 
what unexpectedly, the three class model actually saw a 
slight increase in the BIC This result further reaffirms 
the excellent fit of a two class model. 

As is common in model building, a series of full struc- 
tural equation models (SEM) were run in increasing 
levels of complexity. To start with, the two latent vari- 
ables were regressed on their respective disease states in 
individual models. A path diagram for the normal versus 
single tumor samples is presented in Figure 2. 

These models were estimated using MLR. The latent 
variable regression was performed using logistic regres- 
sion and was significant, p < .001. Additionally, the odds 
ratio was 1.841. This suggests that having the attributed 
measured by the latent variable makes an individual 
1.841 times more likely to develop breast cancer. The 
same model specification for single tumor versus second 
tumor samples produced similar results with p < .001 
and an odds ratio of 3.49. In other words, there appears 
to be a latent factor that is present in those who have a 
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Figure 2 Path diagram for normal versus cancer peptides SEM 
model. 

{ J 

single tumor that is not present in those samples with a 
second primary tumor. 

Furthermore, another SEM model was run combining 
the above two analyses so that the two distinct latent 
variables were used to predict disease status. The esti- 
mation of disease status was done via multinomial logis- 
tic regression. This was done because when the models 
were combined there were three levels of disease. In a 
multinomial logistic regression model, one level (in this 
case single tumor samples) was set as the reference 
group. Then n-1 logistic separate regression equations 
are run; where n is the number of levels of the depen- 
dent variable. Therefore, since each latent variable was 
regressed on disease status, there were two logistic 
regression equations run. Both latent variables predicted 
their respective disease status with p < .01. Again, this 
suggests that normal, single tumor cancer and second 
tumor cancer samples are separated by different sets of 
latent variables. 

The first set of SEM models provided an initial proof 
of concept for full SEM models. This laid the ground- 
work for the more interesting question of what the 
underlying structure looks like for unique parts of the 
immunosignatures. Since further investigations are 
meant to look at the overall differences in immuno- 
signatures as a whole, it is hypothesized that the latent 
factors differentiating groups are specific antibodies pre- 
sent in the sera samples; as explained above. Two 
experimental tests were conducted: a series of structural 
equation models and an examination of the peptide 
means across groups. 

SEM Models of Significant Peptides and Antibodies 

Next, all of the peptides that were statistically significant 
after a Bonferroni correction in the normal versus single 
tumor and second tumor versus single tumor contrasts 
were selected for further analysis. Following the same 
pattern as before, exploratory factor analysis models 
were run to determine how many underlying factors 
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appeared to be present. This was done because selecting 
the top peptides might yield more than one factor; sug- 
gesting more than one antibody. For the normal versus 
single tumor contrast there were 176 peptides that were 
significant and there were 30 significant peptides for the 
second tumor versus single tumor contrast. The eigen- 
values and scree plots suggest a three factor solution for 
the normal versus single tumor contrast and a one fac- 
tor solution for the second tumor versus single tumor 
contrast. A scree plot for the second tumor versus single 
tumor contrast is shown in Figure 3. In other words, for 
the normal versus single tumor, the hypothesis is that 
there are three antibodies that differentiate the groups 
while there is only a single antibody differentiating the 
second tumor versus single tumor groups. 

In the second tumor versus single tumor contrast, fac- 
tor loadings from exploratory factor mixture models and 
Promax EFA models confirm an unstable second factor. 
This is because the loadings on the second factor are 
generally low and minimally larger than the first factor 
loading on the same peptide. Additionally, a two factor 
solution produced Heywood cases in which there were 
communality estimates greater than one; suggesting a 
problem with the two factor model. When single factor 
models were run, all of the peptides loaded highly on 
the one factor. As a result of the EFA models suggesting 
a single factor solution, a full SEM model was run in 
which all of the top 30 peptides were set to load on a 
single latent variable which was then regressed on dis- 
ease status. In this model, the stable latent factor signifi- 
cantly correlated with disease status, p < 0.001. The 
odds ratio of 3.148 suggests that the single hypothesized 
antibody confers significant risk for acquiring a second 
tumor. Also, the means for all of the peptides in the sec- 
ond tumor samples were lower than the means for the 



Scree Plot for Single Tumor versus Second Primary Tumor 



Figure 3 Scree Plot for Single Tumor versus Second Primary 
Tumor. 



single tumor samples. This suggests immune suppres- 
sion. In other words, there appears to be an antibody 
present in samples with a single tumor that is not pre- 
sent in samples with a second tumor. 

The normal versus single tumor samples is a bit more 
complex. A full SEM model containing all three 
hypothesized factors was unable to be estimated because 
there were more peptides than samples. Therefore, there 
were not enough degrees of freedom to run a full model 
containing all 3 groups. As a result, subsets and indivi- 
dual factors were tested individually. When tested indi- 
vidually, all of the three factors/hypothesized antibodies 
significantly correlate with disease, p < 0.01. Two of the 
latent factors positively correlated while the third nega- 
tively correlated with disease status. 

Within the 176 significant peptides for normal versus 
single tumor samples, 162 peptides increase or have a 
higher mean in the cancer samples than in the normal 
samples increase. Conversely, 14 decrease or have a 
higher mean in the normal group than in the cancer 
group. In other words, there appears to be two new 
antibodies present in cancer samples not present in nor- 
mal samples and one antibody present in normal sam- 
ples that is not present in cancer samples. 
Immunosignatures are unique in analysis of the humoral 
response in that they can detect decreases in reactivity 
relative to normal levels. 

One finding of particular note is a high covariance 
between the two positive factors (or proposed antibodies 
present in cancer that are not in normal samples). The 
high covariance and multicollinearity suggests that the 
two are very similar. When regressing both of the posi- 
tive latent variables on disease, in every instance, only 
one of the latent factors was significant with p < .05. 
This is likely due to the way in which multiple regres- 
sion partitions variance. In a multivariate regression 
model, the effect of one variable (x) is the unique con- 
tribution of that variable with all others held constant. 
Because there is so much common or shared variance, a 
vast majority of the variance is used up or accounted for 
by the first factor, not leaving enough unique unex- 
plained variance left for the second factor to be signifi- 
cant as well. 

A two level measurement model was run to test to see 
if the two factors were measuring a similar underlying 
construct. In this model, the two latent factors were set 
to load on a third latent variable. The theory behind 
this test was that if the two latent factors loaded on a 
single second level latent factor then the two original 
factors would be measuring the same underlying con- 
struct. One way this could occur is if the antibody had a 
highly complex structure. However, this model was not 
significant, RMSEA = .21, SRMR = .09. This suggests 
that the two factors are unique albeit highly similar. 
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There are a number of potential interpretations of this 
result. One of the more plausible biological hypotheses 
is the presence of subpopulations. Among two different 
cancer subtypes of single tumor breast cancer, there are 
likely two distinct antibodies; one for each subtype. If 
subpopulations are present in the data, it seems plausi- 
ble that these two antibodies are quite highly similar 
because in the end they are still responding to breast 
cancer. The variations that lead to different subtypes 
may in fact be what makes the two positive latent fac- 
tors separate and distinct from one another. The high 
covariance and multicollinearity may be a function of 
the fact that the two different subtypes are still breast 
cancer. The multicollinearity may be because they vary 
together, not that they have a similar sequence and see 
the same antigen. If two different antigens consistently 
arose in a tumor they would raise antibodies that varied 
together in samples but would see totally different 
antigens. 

A second possibility is that this is modeling different 
times in the disease progression. As disease progresses it 
is likely different antigens are presented by the tumor to 
the immune system. If so, the relative amount of parti- 
cular marker antibodies will also change. 

Discussion 

We have explored a number of statistical models for 
analyzing immunosignatures. Each method explored 
herein helps answer a different research question relat- 
ing to the analysis of immunosignatures. Descriptive sta- 
tistics about an immunosignature can provide high level 
information about the general immune response in a 
signature. Exploratory factor analytic models (PCA and 
EFA) can be useful for classifying immunosignatures 
into different disease groups without any clinical infor- 
mation. CFA models can classify samples onto specified 
factors and could be developed into a useful model for 
determining the disease status of a single sample. As an 
extension, SEM models find some interesting and robust 
latent factor structure to immunosignatures which war- 
rant further investigation. 

Implications of SEM Models 

Latent factors can be reliably extracted from immuno- 
signatures. These latent factors are clear, consistent and 
replicable patterns which differentiate disease state in a 
statistically significant fashion. At the very minimum, 
these latent factors can serve as strong biomarkers for 
disease. Given the design of the technology and the fact 
that antibodies are binding to peptides on immuno- 
signature arrays, it is highly plausible that the latent fac- 
tors are modeling individual antibodies. 

Although future research is needed to conclusively 
confirm the relationship between modeled latent factors 



and antibodies, the potential of having a high-through- 
put bioinformatics-driven method for antibody discovery 
creates countless potential avenues for future applica- 
tions. The primary benefit of this methodological 
approach is to reduce the time it takes to identify anti- 
bodies associated with various clinical situations. Doing 
so will reduce cost and increase the speed of advance- 
ment in biomedicine. Additionally, the reduced cost and 
speed may open doors that were beyond the realm of 
consideration just a short time ago. For example, a 
method for quickly and inexpensively detecting an anti- 
body could play a crucial first step in developing perso- 
nalized vaccines. 

Below we present a multi-step procedure for detecting 
latent factors and potentially antibodies in an immuno- 
signaturing study. The first step is to run an exploratory 
factor analysis on the data with rotation. Various rota- 
tions can be explored but Promax or Geomin are 
recommended. EFA models are a useful starting place 
for multiple reasons. First, it ensures that the groups are 
different constructs and significantly different from one 
another. This determination can be made by looking at 
scree plots and eigenvalues to assess the probable num- 
ber of groups in the model; which should be equal to 
the number of known disease states. The samples 
should load correctly on a given factor with a high clas- 
sification rate. 

At this point, cross loadings in an EFA model can be 
investigated. If clinical data exists, it would be of use to 
try to assess if there are potential reasons for why a spe- 
cific sample may be cross loading. For example, is there 
a history of cancer in a normal sample that cross loads 
on a cancer sample which might suggest the person is 
in a transition phase? This may be a way of detecting 
aberrant cases or outliers. That said, haphazardly remov- 
ing cases from a dataset is NOT advocated in any fash- 
ion. Cross loadings were not analyzed in this paper due 
to a lack of additional information and clinical data 
upon which to draw any relevant conclusions. 

From here, an appropriate test statistic comparing the 
groups can be run on the peptides in order to test for 
statistical significance. T-Tests or logistic regression and 
their multivariate extensions ANOVA and multinomial 
logistic regression are a few potential methodological 
tools. The specific test should be picked with respect to 
the features of the data being analyzed. For example, in 
this paper, we used a Satterthwaite corrected T-Test 
because of unequal samples sizes and variances. A cor- 
rection should be made to protect against alpha infla- 
tion. Although a number of tests exist for this purpose, 
the Bonferroni correction is among the most common; 
even if it may be somewhat conservative. 

A traditional EFA model or an exploratory factor mix- 
ture model can be used to infer the structure of the 
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significant peptides within each group. This information 
can be used to create a full structural equation model. 
However, as part of good model building practices, 
starting with a CFA measurement model is recom- 
mended; especially because the iterative nature and 
complexity of these models may lead to convergence 
problems. Additionally, information from these simpler 
models can be used to specify starting values in full 
SEM models if convergence problems occur. CFA mea- 
surement models specify which peptides load together 
on a given latent factor. Checking the fit of the mea- 
surement models can confirm the accuracy of the 
model. However, given that CFA is so similar to EFA 
methods, it is unlikely that differing results would be 
obtained. 

Once a working measurement model has been 
obtained, a full SEM model can be created by regressing 
the latent factors on disease state. It is important to test 
a full SEM model for a number of reasons. Although 
EFA and CFA models may suggest that a group of sig- 
nificant peptides are related in some way, without a full 
SEM model, there is no way of knowing whether the 
relationship is a significant predictor of a specific disease 
state. In the absence of predictive validity for a specified 
disease state, any relationship among the peptides is tri- 
vial and would not suggest that it is because of a com- 
mon antibody. The same conclusion can made if the 
latent factor is predictive of disease states beyond the 
hypothesized state. 

If a significant SEM model can be obtained, wet lab 
validation can then attempt to determine if the model is 
correct. One potential way of testing this in the wet lab 
would be to use the designated peptides to affinity pur- 
ify the antibody from the sera. The prediction is that 
the different peptides would purify the same antibody. 
This could be tested by immunosignaturing the antibo- 
dies purified. 

Screening and Presymptomatic Screening for Disease 

The relative ease from which samples can be classified 
and differentiated with all of the methods explored 
herein makes this technology an excellent use for dis- 
ease screening. Whether examining the loadings of new 
samples in a CFA model or as part of a larger SEM 
model, this technology can allow researchers to screen 
patients in a variety of contexts. This initial research 
suggests that immunosignaturing could be developed 
into a quick and inexpensive method of screening for 
cancer. Taking a small sera sample from an individual is 
much less expensive and intrusive than traditional 
screening methods such as mammograms. One early 
potential use for immunosignaturing would be to help 
follow at risk populations; such as those individuals with 
a family history of cancer. Immunosignatures could be 



taken at regular intervals between regularly scheduled 
mammograms. If the generated immunosignature from 
an interim test started to suggest a closer similarity to 
cancer, this could prompt physicians to follow the 
patient more closely or advise additional screening. 
Immunosignatures could be used in the same way for 
individuals who already have cancer. In this case, if an 
immunosignature suggested the person was developing 
an antibody signature indicative of a second tumor (or 
more closely loading on a latent factor biomarker), the 
individual could be followed more closely to detect the 
presence of a second primary tumor. 

Screening for a specific disease state is fairly straight- 
forward. A well curated collection of disease samples 
would form baseline control factors. A sera sample 
would be taken from an individual and their sample 
would be allowed to freely load in a CFA model across 
relevant disease conditions. A significant loading on a 
disease factor would provide strong evidence for the 
person having a given disease. 

There are a number of ways in which a presympto- 
matic screening test could be developed from immuno- 
signatures. This could be done by collecting a 
longitudinal or time series sample of sera from an indi- 
vidual and following the factor loadings on a disease 
state over time. As the loadings on a disease factor tend 
to increase the individual could be watched more closely 
and additional screening for a disease could be recom- 
mended by a physician. A number of statistical methods 
and time series analyses such as latent transition analysis 
(LTA) could be employed to model this. 

Conclusion 

Immunosignaturing is a novel approach for understand- 
ing disease. A number of statistical methods including, 
exploratory factor analysis, confirmatory factor analysis, 
descriptive statistics, corrected t-tests, ANOVA, logistic 
and multinomial logistic regression, mixture models and 
structural equation modeling have shown promising 
abilities for analyzing different dimensions of immuno- 
signatures. Immunosignaturing in the context of breast 
cancer has been shown to be a good platform for differ- 
entiating groups of samples based on disease status, 
determining the disease status of specific samples as 
well as potentially serving a role in the discovery of anti- 
bodies for specific diseases. 

Despite many new challenges posed by immunosigna- 
turing microarrays such as competitive binding and 
binding to multiple sites, the analyses conducted here 
clearly illustrate the usefulness of classical analytical 
methods to produce accurate results. The results are 
particularly noteworthy because of the lack of structure 
in the data and lack of a full pattern matched experi- 
mental design. The early results of structural equation 



Brown et al. BMC Bioinformatics 201 1, 12:349 
http://www.biomedcentral.eom/1 471 -21 05/1 2/349 



Page 15 of 15 



modeling are very promising. Although wet lab valida- 
tion is needed for the proposed methodology of anti- 
body discovery, even if the latent factors turn out not to 
be a specific antibody, the model can still serve as an 
excellent biosignatures for disease screening. 

Early detection of cancer is among the best predictors 
of survival. Continued development of immunosignatur- 
ing into a screening and presymptomatic screening diag- 
nostic tool will aid in early discovery and help turn the 
corner in the fight against cancer. Future research in 
this field should aim at validating the hypothesis that 
the latent factors modeled here are in fact antibodies 
and to develop the technology into a diagnostic screen- 
ing tool. 
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